STA 101L - Summer I 2022
Raphael Morsomme
simple linear regression model \[ Y \approx \beta_0 + \beta_1 X \]
multiple linear regression model \[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + + \beta_p X_p \]
categorical predictor
feature engineering
Earlier, we saw that adding the predictor \(\dfrac{1}{\text{displ}}\) gave a better fit.
Let us see if the same idea work with the trees dataset.
Suppose we want to predict volume using only the variable girth.
One could argue that there is a slight nonlinear association
We start with the simple model
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} \]
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m1, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 2
Girth Volume_pred
<dbl> <dbl>
1 8.3 5.10
2 8.30 5.11
3 8.30 5.11
4 8.30 5.12
5 8.30 5.12
6 8.31 5.13
To capture the nonlinear association between Girth and Volume, we consider the predictor \(\text{Girth}^2\).
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 \]
The command to fit this model is
d_tree2 <- mutate(d_tree, Girth2 = Girth^2)
m2 <- lm(Volume ~ Girth + Girth2, data = d_tree2)
compute_R2(m2)[1] 0.962
\(R^2\) has increased! It went from 0.935 (model 1) to 0.962.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2)
Volume_pred <- predict(m2, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 3
Girth Girth2 Volume_pred
<dbl> <dbl> <dbl>
1 8.3 68.9 11.0
2 8.30 68.9 11.0
3 8.30 68.9 11.0
4 8.30 68.9 11.0
5 8.30 69.0 11.0
6 8.31 69.0 11.0
Let us consider the predictor \(\text{Girth}^3\).
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \beta_3 \text{girth}^3 \]
d_tree3 <- mutate(d_tree, Girth2 = Girth^2, Girth3 = Girth^3)
m3 <- lm(Volume ~ Girth + Girth2 + Girth3, data = d_tree3)
compute_R2(m3)[1] 0.963
\(R^2\) has increased! It went from 0.962 (model 2) to 0.963.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2, Girth3 = Girth^3)
Volume_pred <- predict(m3, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 4
Girth Girth2 Girth3 Volume_pred
<dbl> <dbl> <dbl> <dbl>
1 8.3 68.9 572. 9.88
2 8.30 68.9 572. 9.88
3 8.30 68.9 572. 9.89
4 8.30 68.9 572. 9.89
5 8.30 69.0 573. 9.89
6 8.31 69.0 573. 9.90
What if we also include the predictors \(\text{Girth}^4, \text{Girth}^5, \dots, \text{Girth}^{k}\) for some larger number \(k\)?
The following R code fits such a model with \(k=34\), that is,
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \dots + \beta_{29} \text{girth}^{34} \]
[1] 0.98
\(R^2\) has again increased! It went from 0.963 (model 3) to 0.98.
As we keep adding predictors, \(R^2\) will always increase.
Is the model m_extreme a good model?
Note
We want to learn about the relation between Volume and Girth present in the population (not the sample).
A good model should accurately represent the population.
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.00001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m_extreme, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 2
Girth Volume_pred
<dbl> <dbl>
1 8.3 10.3
2 8.30 10.3
3 8.30 10.3
4 8.30 10.3
5 8.30 10.3
6 8.30 10.3
The model m_extreme overfits the data.
A model overfits the data when it fits the sample extremely well but does a poor job for new data.
Remember that we want to learn about the population, not the sample!
We saw that \(R^2\) keeps increasing as we add predictors.
\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST} \]
\(R^2\) can therefore not be used to identify models that overfit the data.
The adjusted-\(R^2\) is similar to $R2, but penalizes large models:
\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST}\dfrac{n-1}{n-p-1} \]
where \(p\) corresponds to the number of predictors.
The adjusted-\(R^2\) therefore balances goodness of fit and parsimony:
The model with the highest adjusted-\(R^2\) typically provides a good fit without overfitting.
Two other popular criteria that balance goodness of fit (small SSR) and parsimony (small \(p\)) are
The formula for AIC and BIC are respectively
\[ AIC = 2p - \text{GoF}, \qquad BIC = \ln(n)p- \text{GoF} \]
where \(\text{GoF}\) measures the Goodness of fit of the model1.
Unlike the adjusted-\(R^2\), smaller is better for the AIC and BIC.
Note that the BIC penalizes the number of parameters \(p\) more strongly.
Easily accessible in R with the command glance. For instance,
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.935 0.933 4.25 419. 8.64e-19 1 -87.8 182. 186.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.962 0.959 3.33 350. 1.52e-20 2 -79.7 167. 173.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.963 0.959 3.35 232. 2.17e-19 3 -79.3 169. 176.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.980 0.960 3.30 48.6 5.64e-10 15 -69.7 173. 198.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
In this case, AIC and BIC favor m2, while the adjusted-\(R^2\) (wrongly) favor m_extreme.
The adjutsed-\(R^2\), AIC and BIC all try to balance
They achieve this balance by favoring models with small SSR while penalizing models with larger \(p\).
But
m_extreme, although it was clearly overfitting the data.Instead of using these criteria, we could look for the model with the best predictions performance.
That is, the model that makes predictions for new observations that are the closest to the true values.
We will learn two approaches to accomplish this
The holdout method is a simple method to evaluate the predictive performance of a model.
Note that the test set consists of new observations for the model.
\(\Rightarrow\) Select the model with the best prediction accuracy in step 3.
The holdout method.
Source: IMS
The following R function splits a sample into a training and a test set
construct_training_test <- function(sample, prop_training = 2/3){
n <- nrow(sample)
n_training <- round(n * prop_training)
sample_random <- slice_sample(sample, n = n)
sample_training <- slice(sample_random, 1 : n_training )
sample_test <- slice(sample_random, - (1 : n_training))
return(list(
training = sample_training, test = sample_test
))
} Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
7 11.0 66 15.6
8 11.0 75 18.2
9 11.1 80 22.6
10 11.2 75 19.9
11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
set.seed(0)
training_test_sets <- construct_training_test(d_tree)
training_set <- training_test_sets[["training"]]
training_set Girth Height Volume
1 11.7 69 21.3
2 16.3 77 42.6
3 10.5 72 16.4
4 11.0 66 15.6
5 8.3 70 10.3
6 8.6 65 10.3
7 14.5 74 36.3
8 11.3 79 24.2
9 20.6 87 77.0
10 13.3 86 27.4
11 13.7 71 25.7
12 17.5 82 55.7
13 11.2 75 19.9
14 18.0 80 51.0
15 14.0 78 34.5
16 17.9 80 58.3
17 11.1 80 22.6
18 10.7 81 18.8
19 14.2 80 31.7
20 12.0 75 19.1
21 11.4 76 21.0
We simply fit our regression model to the training set.
To evaluate the prediction accuracy, we start by computing the predictions for the observations in test set.
A good model will make predictions that are closed to the true values of the response variable.
A good measure of prediction accuracy is the sum of squared errors (SSE).
\[ SSE = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2 \]
Small SSE is better.
In practice, the (root) mean sum of squared errors ((R)MSE) is often used.
\[ MSE = \dfrac{SSE}{m} = \dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m} \]
\[ RMSE = \sqrt{\dfrac{SSE}{m}} = \sqrt{\dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m}} \] Again, small (R)MSE is better.
SSE <- sum((Volume - Volume_hat)^2)
m <- nrow(test_set)
MSE <- SSE / m
RMSE <- sqrt(MSE)
SSE; MSE; RMSE[1] 225.2413
[1] 22.52413
[1] 4.745959
Tip
##(R)MSE The advantage of (R)MSE over the SSE is that we can compare models evaluate with test sets of different sizes.
Apply steps 2 and 3 on different models.
Simply choose the model with the lowest (R)MSE.
This model has the best out-of-sample accuracy.
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
compute_RMSE(test_set, m1)[1] 4.745959
[1] 4.16503
[1] 4.093268
[1] 13.47366
We select m3.
The previous analysis was conducted with set.seed(0).
Models m2 and m3 were pretty close.
Could we have obtained a different result with a different seed?
# Step 1
set.seed(1) # new seed
training_test_sets <- construct_training_test(d_tree)
training_set <- training_test_sets[["training"]]
test_set <- training_test_sets[["test"]]
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
compute_RMSE(test_set, m1)[1] 6.589842
[1] 4.332965
[1] 5.437581
[1] 310105724
A drawback of the holdout method is that the test set matters a lot.
Repeating steps 2 and 3 with a different partition from step 1 may give different results.
Cross-validation (CV) is the natural generalization of the holdout method – you simply repeat the holdout method many times.
Repeat the following steps \(k\) times
Let fold 1 be the test set and the remaining folds be the training set.
Fit the model to the training set (like the holdout method)
Evaluate the prediction accuracy of the model on the test set (like the holdout method)
Go back to 1, letting the next fold be the test set.
\(\Rightarrow\) Select the model with the best overall prediction accuracy in step 3.
5-fold cross-validation
Source: towardsdatascience
set.seed(0)
# Setup
n <- nrow(d_tree)
n_folds <- 10
# Fold assignment
folds <- c(rep(1:n_folds, n %/% n_folds), seq_along(n %% n_folds))
folds [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5
[26] 6 7 8 9 10 1
Girth Height Volume fold
1 11.7 69 21.3 1
2 16.3 77 42.6 2
3 10.5 72 16.4 3
4 11.0 66 15.6 4
5 8.3 70 10.3 5
6 8.6 65 10.3 6
RMSE <- create_empty_RMSE()
for(i in 1:n_folds){
# Step 1
training_set <- filter(d_tree_fold, fold != i)
test_set <- filter(d_tree_fold, fold == i)
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
rmse_m1 <- compute_RMSE(test_set, m1 )
rmse_m2 <- compute_RMSE(test_set, m2 )
rmse_m3 <- compute_RMSE(test_set, m3 )
rmse_m48 <- compute_RMSE(test_set, m48)
RMSE <- add_row(RMSE, rmse_m1, rmse_m2, rmse_m3, rmse_m48)
}Which model has the lowest RMSE across all folds?
# A tibble: 10 x 4
rmse_m1 rmse_m2 rmse_m3 rmse_m48
<dbl> <dbl> <dbl> <dbl>
1 5.01 3.28 3.40 4.28
2 2.95 2.52 2.77 5.15
3 3.25 4.73 4.61 6.50
4 3.82 4.18 4.44 3.27
5 3.41 2.08 2.03 204.
6 3.91 2.61 2.71 52.0
7 5.13 4.08 4.00 3.03
8 3.19 4.01 3.78 2.56
9 7.09 1.89 2.39 10911566.
10 5.17 3.47 3.36 3.86
Group exercise - cross-validation
Exercise 25.9 (the book uses SSE for rSSR)
In part c, replace left and right, by top and bottom.
03:00
# A tibble: 1 x 4
rmse_m1 rmse_m2 rmse_m3 rmse_m48
<dbl> <dbl> <dbl> <dbl>
1 4.29 3.29 3.35 1091185.
Model 2 has slightly better average RMSE
Set \(k=n\); that is, we use \(n\) folds, each of size \(1\). The test sets will therefore consist of a single observation and the training sets of \(n-1\) observations.
RMSE <- create_empty_RMSE()
# Step 0
n <- nrow(d_tree)
d_tree_loocv <- mutate(d_tree, folds = 1:n)
for(i in 1:n){
# Step 1
training_set <- filter(d_tree_loocv, folds != i)
test_set <- filter(d_tree_loocv, folds == i)
# Step 2
m1 <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2 <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3 <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
# Step 3
rmse_m1 <- compute_RMSE(test_set, m1 )
rmse_m2 <- compute_RMSE(test_set, m2 )
rmse_m3 <- compute_RMSE(test_set, m3 )
rmse_m48 <- compute_RMSE(test_set, m48)
RMSE <- add_row(RMSE, rmse_m1, rmse_m2, rmse_m3, rmse_m48)
}
summarize_all(RMSE, mean)# A tibble: 1 x 4
rmse_m1 rmse_m2 rmse_m3 rmse_m48
<dbl> <dbl> <dbl> <dbl>
1 3.64 2.88 2.87 7224032.
Not my favorite method,
but this is something you should learn because it is widely used.
There are two types of stepwise selection procedures: (i) forward and (ii) backward.
Choose a models election criterion that balances model fit (smaller SSR) and parsimony (small \(p\))
\(\Rightarrow\) adjusted-\(R^2\), AIC or BIC
Start with the empty model \(Y \approx \beta_0\), i.e. the model with no predictor. This is our current model
Fit all possible models with one additional predictor.
Compute the AIC of each of these models.
Identify the model with the smallest AIC. This is our candidate model.
If the AIC of the candidate model is larger than the AIC of the current model (no new model improves on the current one), the procedure stops, and we select the current model.
Group exercise - stepwise selection
Exercise 8.11. In addition, fit the candidate models in R with the command lm.
05:00
Similar to forward selection, except that
Note that forward and backward selection need not agree; they may select different models.
Group exercise - stepwise selection
Exercise 8.11. In addition, fit the current and candidate models in R with the command lm.
05:00
::: tabset-tip ## Parsimony To obtain a parsimonious model